COVID-19 and US Intrastate Travels

Overview

The goal for this project is to identify whether intrastate travel has any affect on the increase of COVID rates per state. We will clean and merge two datasets, one containing the daily COVID rates per state and the other summarizing intrastate travel per state. We intend to focus on daily positive increases as a function of intrastate travel, and will explore the data using both visual and predictive analysis. In despite of analyzing the data from multiple angles, our models predict that there is no direct correlation between the spread of COVID-19 and the number of trips within a state.

Note: Please download the HTML notebook file (or run this file) on your local machine to see the visualizations properly.

Names

Research Question

"Can the new daily COVID-19 cases per state be explained by the number and distance of travel in the US?"

Background and Prior Work

Between August 2nd and September 5th 2020, the CDC reported that the weekly COVID-19 cases within individuals aged 18 and 22 increased to 55% across the nation [1]. This has resulted in warnings issued to Americans from international travel, pointing to the enhancement of safety and health conditions in certain states (cdc.gov). The CDC has continually informed the public on how traveling potentially increases the possibility of being infected and spreading the coronavirus and suggests that staying at home is the best preventive measure. However, many individuals have not yet taken such measures of spreading the virus seriously and still travel around [2]. Therefore, to adequately create public awareness and offer scientific analysis on the necessity of keeping of maintaining social distance, we demonstrate the correlation between the spread of COVID-19 and traveling mathematically through data algorithms [3].

Various studies have been carried out to offer the algorithms for different viruses’ widespread presence techniques or predict the disease's spread rate. For example, WU and Leung [4] established the framework used to indicate the possible local and international coronavirus spread.

In this detailed data analysis, the correlation between traveling between cities with new confirmed cases of COVID-19 in United States is demonstrated. The impact of air and urban traffic involving the passenger population and the number of flights, the new COVID-19 cases have been probed. This data's significance is that irrespective of the deficiency of comprehensive information concerning the data figure of travels among Americans, a relationship between the accumulative number of trips and the spread of COVID-19 cases can be established.

References (include links):

Hypothesis

Our primary hypothesis is that we predict that the number of new daily COVID-19 cases is positively correlated with the number travels and hope to illustrate that less travel results in less cases. Along with the number of travels, we also predict that the distance of travel is positively correlated with the number of new daily COVID-19 cases, though weaker than to the total number of travels in that state.

We believe in these such ways as COVID-19 is a respiratory disease and it is known to be transmitted mainly through respiratory droplets produced when an infected person coughs, sneezes, or talks. These droplets can land in the mouths or noses of people who are nearby or possibly be inhaled into the lungs. With that in mind, we hypotheisze that more travels and longer trips lead to a higher chance of unintentional COVID-19 spread.

Datasets

Our first dataset is "Trips by Distance" from the US Department of Transportation, Bureau of Transportation Statistics. The dataset contains a number of trips separated by distance (1-3, 3-5, 5-10, etc) and by county/state, starting on January 1st, 2019. But, since COVID-19 dataset starts from 1/22/2020, we will use only a partial date dataset (1/22/2020 - 10/10/2020).

Trips are defined as movements that include a stay of longer than 10minutes at an anonymized local away from home. Trips capture travel by all modes of transportation including driving, rail, and air. This dataset tracks mobile data geolocation as an indication for movement. It doesn't specify between modes of transportation, however we consider this more significant as it captures motion otherwise uncollectable.

Our download of this dataset is October 10th, 2020.

https://data.bts.gov/Research-and-Statistics/Trips-by-Distance/w96p-f2qv

Our second data comes from Kaggle, us_states_covid19_daily.csv. This dataset shows the number of daily COVID-19 cases per day by US state.

https://www.kaggle.com/sudalairajkumar/covid19-in-usa?select=us_covid19_daily.csv

Our download of both datasets is October 10th, 2020.

Setup

Data Cleaning

First we will focus on cleaning the Trips by distance dataset. While it contains multiple levels (county, state, national), we are only interested in the state level. Then we will merge it with the Daily COVID dataset and merge them together. Lastly, we will break the datasets up per state for easier manipulation and to minimize one-hot encoding later.

The first case of COVID-19 in USA was on 1/20/2020, but since our other dataset starts from 1/22/2020, we will use that date instead.

We decided that combining the 10 "number of trips" columns into 3 groups will be more efficient for our analysis. The first group summons the number of trips ranging from less than 1 mile to 25 miles (inclusive). The second group summons the number of trips ranging from 25 miles (exclusive) to 250 miles, and the last group from 250 miles (exclusive) to 500 miles.

Next, we want to prepare the COVID-19 dataset.

Summary of Data Cleaning

Both datasets were standardized in their values and date formatting. We had an easy time merging and extracting only what we were interrested in for analysis.

Pre-processing

We didn't have to do any pre-processing to our datasets as both csv files were collected by their respective source.

Columns Dropped

From the Trips By Distance dataset, we dropped: "Level", "State FIPS", "County FIPS", and "County Name".

From the Daily COVID 19 dataset, we dropped: 'index', 'negative', 'pending', 'totalTestResults', 'hospitalizedCurrently', 'hospitalizedCumulative', 'inIcuCurrently', 'inIcuCumulative', 'onVentilatorCurrently', 'onVentilatorCumulative', 'recovered', 'dataQualityGrade', 'lastUpdateEt', 'dateModified', 'checkTimeEt', 'death', 'hospitalized', 'dateChecked', 'totalTestsViral', 'positiveTestsViral', 'negativeTestsViral', 'positiveCasesViral', 'deathConfirmed', 'deathProbable', 'totalTestEncountersViral', 'totalTestsPeopleViral', 'totalTestsAntibody', 'positiveTestsAntibody', 'negativeTestsAntibody', 'totalTestsPeopleAntibody', 'positiveTestsPeopleAntibody', 'negativeTestsPeopleAntibody', 'totalTestsPeopleAntigen', 'positiveTestsPeopleAntigen', 'totalTestsAntigen', 'positiveTestsAntigen', 'fips', 'negativeIncrease', 'total', 'totalTestResultsSource', 'totalTestResultsIncrease', 'posNeg', 'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore', 'negativeRegularScore', 'negativeScore', 'positiveScore', 'score', and 'grade'

Columns Merged

We merged several of the Trips By Distance dataset into four groupings for analysis based on distance range.

Columns Added

We added state regions which will later help with plotting. We also added ratio, which calculates proportion of population which is not staying at home.

Rows Dropped

We didn't drop any rows due to null values. Instead, we converted them to zeroes as they only occured in states before any COVID19 cases were reported.

Other Information Dropped

From the Trips by Distance dataset, we dropped 2059493 rows as they were either at the county or national level since we only cared about the state level and interstate travel. We note there will be some loss in granularity of our data, but our Trips by Distance dataset doesn't describe the county level so the additional data is irrelevant.

Descriptive Analysis

In summation, we have a large dataset analyzing trips and daily COVID cases in 50 states + Washington DC between Jan 22 and October 10, 2020.

Exploratory Data Analysis

First, we will look at a heatmap that shows correlation between the distance traveled and positive increase variables.

For each state, we will drop any row that does not have a positive increase > 0. This makes no mathematical sense as a state cannot have negative cases of positive cases of COVID-19. We believe that these were data entry errors, so we'll omit them for our analysis.

Let's take a look an intial look at a national level of new COVID19 cases per day.

We will distinguish colors by regions for easy inference. Code inspired by https://stackoverflow.com/questions/26139423/plot-different-color-for-different-categorical-levels-using-matplotlib. Color inspired by https://in.pinterest.com/pin/773141461014013480/?d=t&mt=signup

In previous steps, we classified states into 4 groups: M, N, S, W--which stand for the Midwestern, Northern, Southern, and Western states of USA.

Immediately we can notice there are (unfortunately) three peaks, one early in the North region and then later in the South and West regions, and lastly what looks like all four regions. The North region also has one extremely negative outlier while South region has several positive outliers.

Below is a plotly interactive visualization of the daily COVID-19 cases in each state. Please don't forget to download and run this notebook to view it properly.

Let's next look at how the total number of trips have changed over the same time range.

The states in the same regional category (M, N, S, W) are sharing the same color. This is a little concerning. Aside from an initial decrease between March and April in total trips for certain states, there seems to be little to no change in total trips for the subsequent months while there were two peaks in COVID increases (ignoring the lone spike).

Again, another plotly interactive visualization of the number of daily travels in each state.

Because US states like California and Texas are relatively highly populous and hence have higher number of travels, this visualization doesn't depict the travels of other less populous states as intended. Hover over each state to see the accurate numbers increase or decrease.

Let's plot the relationship between Number of Trips and Positive Increase. If the Number of Trips doesn't change very much, but new positive cases does, then it will be hard to establish a relationship between the two.

There seems to be some positive correlation, however there are clear differences depending even within regions. There is also the issue that at the far right of the graph where there are a lot of trips with zero new positive COVID cases. This is explained by how we see a large number of trips in the West early in the year, while the COVID cases only increased in the later half of the year. Another explanation for all the variability is due to the fact that not all states have an equal distribution in population and average trips, with more populous states (i.e. California) skewing the data. To avoid regional and national bias, we will now analyze everything at a state level.

As seen in the graph, there is extremely varying data regarding the relationship each state has with new COVID cases. Some of this is due to how most states only started to get COVID increases several months after we started tracking.

Interestingly, many states have an overall negative correlation with total trips, but at a distance of 25-250 miles the majority of states have a positive correlation for trips, with some of the positive correlation lingering in longer trips.

Data Analysis and Results

One of the challenges for our hypothesis is that each state exhibits different seasonality in COVID-19 cases, which is expected considering the North East was an early hotspot for the virus. This means that we will have to analyze everything at a state level and compare accuracy scores at an aggregate level.

Positive Increase as a Function of Trips sorted into Thresholds (Distance of trips)

Total trips cannot be included as a predictor since it has perfect collinearity with the other categorical variables

As seen above, the majority of the models fail to capture any meaningful relationship, and the ones with moderate accuracy (i.e. WV) have extremely low coefficients on the predictors.

Positive Increase as a function of Ratio not Staying Home and Total Trips

Since trips threshold doesn't make a difference, we will consolidate them into one variable. We will also introduce the ratio variable which calulates the proportion of a state's population that are not staying home.

This model performed only marginally better than the previous one. The coefficients on number of trips are still close to zero. What is surprising however is that ratio of people not staying home has a negative effect on positive increases in cases.

The states in the same regional category (M, N, S, W) are sharing the same color. There is a general negative decrease in ratio not staying at home while positive cases are increasing which could explain why the coefficient is negative. Regardless, this model is also unsatisfactory in establish a correlation.

Again, the relatively highly populous states make the less populous states' data look static. Hover over each state to see the accurate numbers increase or decrease.

Positive Increase as a Time Series Analysis with number of trips (nTrips) and Ratio not Staying Home

Since the data has a heavy reliance on time, we will consider it like a stock price analysis, where today's gains are have some dependence on yesterday's gains.

This model works really well in some states, and terribly in others. Lets see whether this can be a function of region given that it looked like region had an effect on Ratio earlier.

Below is the State of California's OLS regression result

It doesn't look like region has anything to do with accuracy of the model. However, the average high accuracies combined with the small p-values for Yesterday's gains shows that this model captures a lot of the variance for most states.

Ethics & Privacy

The datasets we are using have been stripped of personal identifiers and has been publicly shared by the COVID 19 Project and NY Times so privacy concerns did not apply. With that being said, we are cognizant of our data including deaths and ventilator usage, and while not the direct topic of our research we will keep these implications in the forefront of our minds. Our intentions with this question is to identify how interstate travel is correlated with COVID cases to provide policymakers with more information to better inform their decision making, and not put blame on those who travel interstate. The data we found is generated by volunteers collecting publicly shared information by county, hospital, and state and then compiling it into a central database. We expect the data to be generally unbiased as the data collected is nondiscriminatory, however it may be underrepresentative of 'true' figures due to lack of self reporting by individuals, delays in the updating of records, and lack of official government support.

We also understand that interstate travel is the primary source of income for many individuals (i.e. delivery truck drivers) and do not intend to presume that they should be held responsible for the amount of COVID cases found as a result of our research. Additionally, if our research does lead to interstate travel restrictions, we hope that those whose incomes are reliant upon interstate travel are properly supported and that no blame should be placed onto them.

Again, our goal is to help inform future policy changes and raise awareness regarding transmission rates by travel, rather than place fault upon any individual.

Conclusion & Discussion

The general understanding of how COVID-19 is spread indicates that direct, indirect, and/or close interaction with those who have COVID-19 are the most common forms. Our study concludes that the association between the number of trips and the number of individuals who tested positive for COVID-19 more than likely boiled down to a number of confounding variables that we did not or could not fully investigate. Although we may have found certain correlations between the number of trips and the number of individuals with a positive result for COVID-19, all of these correlations were rather weak when we further analyzed and investigated the data.

Due to the high variability of populations between states, we found that comparing the number of trips between all states resulted in exaggerated numbers of trips for states with higher populations. Therefore, in order to mediate and further analyze the data, we decided to instead investigate the relationship between the population of the state and the number of COVID-19 cases. By doing this we found that the number of people who tested positive for COVID-19 significantly correlated with the population of the state. Therefore, we ultimately decided to examine our data on a state level inorder to avoid any regional and national biases.

For our study we used models such as linear regression to perform various means of data analysis. Within our data analysis the number of trips and the novel cases of COVID-19 were heavily scrutinized. Although, we soon realized that most of our data, specifically between the months of March and April, were more than likely biased as around that time of the year COVID-19 testing was not done appropriately. While we safely assumed that the trips data from the Bureau of Transportation Statistics accurately reproduced the actual travel data, we understood that the daily COVID-19 cases data likely weren’t as accurate due to limited testing availability and various state/local mandates. We believe that this data discrepancy was one of the variables that led to our result to be inconclusive.

Overall, our models predict that there is no direct correlation between the spread of COVID-19 and the number of trips within a state. However, we found that the data we collected is heavily reliant on time. Therefore, after considering yesterday's positive increases, we found that the model gained in accuracy but the significance of total number of trips remains close to zero. This implies that the number of trips has no effect on new positive cases. While we couldn't explicitly test for it, we would make a new hypothesis that travelling with social distancing measures (i.e. facemasks) has no effect on new cases, and may help with future policy guidelines. In the future, as more data begins to arise, we hope to build a model that provides meaningful insights. Additionally, we realized that the ever increasing threat of COVID-19 poses an excessive danger to the international community. Thus, we found that there have been multiple studies conducted to help search for and discover ways in which outbreaks of the virus can be prevented [1]. We believe that that should serve as a foundation of hope for the eventual eradication of COVID-19.

[1] Frisan, T. (2020). Faculty Opinions recommendation of Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2). Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature. doi:10.3410/f.737557783.793572927.

Team Contributions

Taylor

Luis

Seoyoung

Mitchell